OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
نویسندگان
چکیده
We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.
منابع مشابه
Dual Subtitles as Parallel Corpora
In this paper, we leverage the existence of dual subtitles as a source of parallel data. Dual subtitles present viewers with two languages simultaneously, and are generally aligned in the segment level, which removes the need to automatically perform this alignment. This is desirable as extracted parallel data does not contain alignment errors present in previous work that aligns different subt...
متن کاملConstructing Parallel Corpus from Movie Subtitles
This paper describes a methodology for constructing aligned German-Chinese corpora from movie subtitles. The corpora will be used to train a special machine translation system with intention to automatically translate the subtitles between German and Chinese. Since the common length-based algorithm for alignment shows weakness on short spoken sentences, especially on those from different langua...
متن کاملUsing Movie Subtitles for Creating a Large-Scale Bilingual Corpora
This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church’s sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correl...
متن کاملNot All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models
Neural conversational models require substantial amounts of dialogue data to estimate their parameters and are therefore usually learned on large corpora such as chat forums, Twitter discussions or movie subtitles. These corpora are, however, often challenging to work with, notably due to their frequent lack of turn segmentation and the presence of multiple references external to the dialogue i...
متن کاملTHE EFFECT OF STANDARD AND REVERSED SUBTITLING VERSUS NO SUBTITLING MODE ON L2 VOCABULARY LEARNING
Audiovisual material accompanied by interlingual subtitles is a powerful pedagogical tool which can help improve the vocabulary learning of second-language learners. This study was intended to determine whether or not the mode (standard and reversed) of subtitling affects the incidental vocabulary acquisition of Iranian L2 learners while watching TV programs. Forty-five participants were random...
متن کامل